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Varentropy Decreases Under the Polar Transform 

Erdal Ankan 


Abstract —We consider the evolution of variance of entropy 
(varentropy) in the course of a polar transform operation on 
binary data elements (BDEs). A BDE is a pair (X , Y) consisting 
of a binary random variable A' and an arbitrary side information 
random variable Y. The varentropy of (A, 1') is defined as 
the variance of the random variable — logp X |y(A|y). A polar 
transform of order two is a certain mapping that takes two inde¬ 
pendent BDEs and produces two new BDEs that are correlated 
with each other. It is shown that the sum of the varentropies at 
the output of the polar transform is less than or equal to the 
sum of the varentropies at the input, with equality if and only 
if at least one of the inputs has zero varentropy. This result is 
extended to polar transforms of higher orders and it is shown 
that the varentropy decreases to zero asymptotically when the 
BDEs at the input are independent and identically distributed. 

Index Terms —Polar coding, varentropy, dispersion. 

I. Introduction 

We use the term “varentropy” as an abbreviation for “vari¬ 
ance of the conditional entropy random variable” following 
the usage in (TJ. In his pioneering work, Strassen El showed 
that the varentropy is a key parameter for estimating the 
performance of optimal block-coding schemes at finite (non- 
asymptotic) block-lengths. More recently, the comprehensive 
work by Polyanskiy, Poor and Verdu 0 further elucidated 
the significance of varentropy (under the name “dispersion”) 
and rekindled interest in the subject. In this paper, we study 
varentropy in the context of polar coding. Specifically, we 
track the evolution of average varentropy in the course of polar 
transformation of independent identically distributed (i.i.d.) 
BDEs and show that it decreases to zero asymptotically as 
the transform size increases. As a side result, we obtain an 
alternative derivation of the polarization results of J4), 0. 

A. Notation and basic definitions 

Our setting will be that of binary-input memory less channels 
and binary memoryless sources. We treat source and channel 
coding problems in a common framework by using the neutral 
term “binary data element” (BDE) to cover both. Formally, a 
BDE is any pair of random variables (X. Y) where X takes 
values over X = {0,1} (not necessarily from the uniform 
distribution) and Y takes values over some alphabet y which 
may be discrete or continuous. A BDE (X, Y ) may represent, 
in a source-coding setting, a binary data source X that we 
wish to compress in the presence of some side information 
Y ; or, it may represent, in a channel-coding setting, a channel 
with input X and output Y. 

Given a BDE (X, Y), the information measures of interest 
in the sequel will be the conditional entropy random variable 

h(X\Y) ^ -\ogp xlY (X\Y), 
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the conditional entropy 

H(X\Y) =Eh(X\Y), 
and, the varentropy 

V(X\Y) = Va,r(h(X\Y)). 

Throughout the paper, we use base-two logarithms. 

The term polar transform is used in this paper to to refer to 
an operation that takes two independent BDEs (X -\, Y\) and 
(A 2 , Yf as input, and produces two new BDEs ( U-\. Y) and 
(U 2 ; Ui, Y) as output, where U\ = X\ © X 2 , U 2 = X- 2 , and 
Y = (Yi,Y 2 ). The notation “©” denotes modulo-2 addition. 

B. Polar transform and varentropy 

The main result of the paper is the following. 

Theorem 1. The varentropy is nonincreasing under the polar 
transform in the sense that, if (X\, Y 2 ), (X 2 , Y 2 ) ore any two 
independent BDEs at the input of the transform and (Ui, Y), 
(C/ 2 ; Ui, Y) are the BDEs at its output, then 

V(£A|Y) + V(C/ 2 |C/ 1; Y) < VpfilYi) + V(X 2 |Y 2 ), (l) 

with equality if and only if (iff either V(Xi\Y\) = 0 or 
U(X 2 |Y 2 ) = 0. 

For an alternative formulation of the main result, let us 
introduce the following notation: 

/iin,i = h{Xf u), h' m 2 = h(X 2 \Y 2 ), (2) 

houvi = h(lh |Y), h^, 2 = h(U 2 \Ui,Y). (3) 

Theorem Q] can be reformulated as follows. 

Theorem [I]. The polar transform of conditional entropy 
random variables, (h ul) i , /i m , 2 ) -A h out) 2 ), produces 

positively correlated output entropy terms in the sense that 

Cov(h out p,h outi 2 ) > 0, (4) 

with equality iff either Var(/i,„ i) = 0 or Var(/i,„ 2 ) = 0. 

This second form makes it clear that any reduction in 
varentropy can be attributed entirely to the creation of a 
positive correlation between the entropy random variables 
/i outj i and /i ou t ,2 at the output of the polar transform. 

Showing the equivalence of the two claims 01 and ([4} is a 
simple exercise. We have, by the chain rule of entropy, 

/lout,l T /tout,2 = /tin.l T" /tin,2j (5) 

hence, Var(/i outj i + h outj2 ) = Var(/i in! i + /i inj2 ). Since /z in> i 
and h m2 are independent, Var(/ii n i + /q n , 2 ) = Var(/ij n .i) + 
Var(/i in , 2 ); while Var(/i out! i + h ouu2 ) = Var(/i out! i) + 
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Var(/i ou t, 2 ) + 2Cov(ft outi i,/i ou t i 2)- Thus, the claim Q, which 
can be written in the equivalent form 


Var(^,0 + Var(/i outj2 ) < Var(Vi) + Var(/i in , 2 ), 

is true iff © holds. 

A technical question that arises in the sequel is whether the 
varentropy is uniformly bounded across the class of all BDEs. 
This is indeed the case. 

Lemma 1. For any BDE (X, Y), V{X\Y) < 2.2434. 

Proof: It suffices to show that the second moment of 
h(X\Y) satisfies the given bound. 

E[h(X\Y) 2 ] < max [a;log 2 (x) + (1 — x) log 2 (l — x)] 

0<x<l 

< 2 max [ilog 2 (a;)] = 8e _2 log 2 (e) ~ 2.2434. 

0<£C<1 


(A numerical study shows that a more accurate bound on 
V(X\Y) is 1.1716, but the present bound will be sufficient 
for our purposes.) ■ 

This bound guarantees that all varentropy terms in this paper 
exist and are bounded; it also guarantees the existence of the 
covariance terms since by the Cauchy-Schwarz inequality we 
have | Cov(fr outi i,/i outi2 )| < v / Var(/i outj i) Var(/i outj2 ). 

We will end this part by giving two examples in order to 
illustrate the behavior of varentropy under the polar transform. 
The terminology in both examples reflects a channel coding 
viewpoint; although, each model may also arise in a source 
coding context. 

Example 1. In this example, (X, Y) models a binary symmet¬ 
ric channel (BSC) with equiprobable inputs and a crossover 
probability 0 < e < 1/2; in other words, X and Y take values 
in the set {0,1} with 


Px,r(x,y) 


i(i-e), if x = y; 
ie, ifx^y. 


Fig. □ gives a sketch of the varentropy and covariance terms 
defined above, with Var (hi„) denoting the common value 
of Var(/(; K ,i) and Var(/i,„ i 2 ))- (Formulas for computing the 
varentropy terms will be given later in the paper.) The non¬ 
negativity of the covariance is an indication that the varen¬ 
tropy is reduced by the polar transform. 


Example 2. Here, (X, Y) represents a binary erasure channel 
(BEC) with equiprobable inputs and an erasure probability e. 
In other words, X takes values in {0,1}, Y takes values in 
{0,1, 2}, and 


Px,y(x, y) 


1 ( 1-4 


if % = Vi 

ify = 2- 


In this case, there exist simple formulas for the varentropies. 
Var(/i, n) i) = Var= Var (h in ) = e(l - e), Var (h outi2 ) = 
(2e — e 2 )(l — e) 2 , Var(/i 0 „ fj i) = e 2 (l — e 2 ). The covariance is 
given by Cov(/t OM , ! i, h out p) — e 2 (l — e) 2 . The corresponding 
curves are plotted in Fig. [ 2] 



Fig. 1. Variance and covariance of entropy for BSC under polar transform. 



Fig. 2. Variance and covariance of entropy for BEC under polar transform. 


C. Organization 

The rest of the paper is organized as follows. In Section HT1 
we define two canonical representations for a BDE (X, Y) that 
eliminate irrelevant details from problem description and sim¬ 
plify the analysis. In Section [HI] we review some basic facts 
about the covariance function that are needed in the remainder 
of the paper. Section Qv] contains the proof of Theorem ED 
Section [V] considers the behavior of varentropy under higher- 
order polar transforms and contains a self-contained proof of 
the main polarization result of 0. 

Throughout, we will often write p to denote 1 p for a real 
number 0 < p < 1. For 0 < p, q < 1, we will write p * q to 
denote the convolution pq +pq. 

II. Canonical representations 

The information measures of interest relating to a given 
BDE (X, Y) are determined solely by the joint probability 
distribution of (X, Y)\ the specific forms of the alphabets X 
and y play no role. We have already fixed X as {0,1} so 
as to have a standard representation for X. It is possible and 
desirable to re-parametrize the problem, if necessary, so that 
y also has a canonical form. Such canonical representations 
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have been given for Binary Memoryless Symmetric (BMS) 
channels in fl6). The class of BDEs (A, Y) under consideration 
here is more general than the class of BMS channels, but 
similar ideas apply. We will give two canonical representa¬ 
tions for BDEs, which we will call the a-representation and 
the ,0-representation. The a-rcpresentation replaces y with a 
canonical alphabet A C [0,1], and has the property of being 
“lossless”. The /3-representation replaces y with B C [0,1/2]; 
it is “lossy”, but happens to be more convenient than the a- 
representation for purposes of proving Theorem 0] 

A. The a-representation 

Given a BDE (A, Y), we associate to each y G y the 
parameter 

a(y) = <x x \y{v ) =Px\y{0\y) 

and define A = a{Y). The random variable A takes values 
in the set A = {a{y) : y G /V}, which is always a subset 
of [0,1]. We refer to A as the a-representation of (X,Y). 
The a-representation provides economy by using a canonical 
alphabet A in which any two symbols y,y' G y are merged 
into a common symbol a whenever a(y) = a(y') = a. 

We give some examples to illustrate the a-representation. 
For the BSC of Example Q] we have a(0) = 1 — e, a(l) = e, 
A = {e,l~ e}. In the case of the BEC of Example [2] we have 
a(0) = 1, a(l) = 0, a(2) = 1/2, A = {0,1/2,1}. As a third 
example, consider the channel y = (—l) x c + z where c > 0 
is a constant and z ~ A r (0. 1) is a zero-mean unit-variance 
additive Gaussian noise, independent of x. In this case, we 
have 

e -(y-c) 2 /2 l 

a ^ _ e -(y~c) 2 /2 _|_ e -(y+c) 2 /2 ~ J _|_ e -2 cy' 

giving -A = (0,1). 

The a-representation provides “sufficient statistics” for 
computing the information measures of interest to us. To illus¬ 
trate this, let (X. Y) be an arbitrary BDE and let A = a(Y) be 
its a-representation. Let Fa denote the cumulative distribution 
function (CDF) of A. 

The conditional entropy random variable is given by 

h(X\Y) = h(X\A) = j° S ^’ * = 0 '' (6) 

[-log A, A = 1. 

Hence, the conditional entropy can be calculated as 

H{X\Y) =Eh(X\Y) =Eh(X\A) = E A E xlA h{X\A) 

= E a TL(A) = E'H(A) = f H(a)dF A (a), (7) 
Jo 

where H(a) = — aloga — aloga, a G [0,1], is the binary 
entropy function. Likewise, the varentropy is given by 

V{X\Y) = V{X\A) = E H 2 {A) - [EU(A)] 2 , (8) 

where TL 2 {a) = —a log 2 a — a log 2 a and 

E H 2 (A)= [ n 2 {a)<iF A {a). 

Jo 


Finally, we note that H(X) = TL{px{ 0)) = 7f(EA). 
Thus, all information measures of interest in this paper can 
be computed given knowledge of the distribution of A. 

B. The ft-representation 

Although the a-representation eliminates much of the ir¬ 
relevant detail from (A, Y), there is need for an even more 
compact representation for the type of problems considered in 
the sequel. This more compact representation is obtained by 
associating to each y G y the parameter 

0{y) = Px\v(y) = min{px|r(0|y),p.Y|y (l|y)}- 
We define the /3-representation of (A, Y) as the random 
variable B = ft (Y). We denote the range of B by B = {ft(y) : 
y &y} and note that B C [0,1/2]. 

The /3-representation can be obtained from the a- 
representation by 

ft(y) = min{a(y), 1 - a(y)}, B = min{A, A}; 

but, in general, the a-representation cannot be recovered from 
the /3-representation. 

For the BSC of Example [Q we have ft(0) = ft(l) = e, 
giving B = {e}. For the BEC of Example [2] we have ft(0) = 
ft(l) = 0, ft(2) = 1/2, and B = {0,1/2}. For the binary-input 
additive Gaussian noise channel, we have 

1 _|_ e 2c\y\ ’ 

with B = (0,1/2], 

As it is evident from ©, the conditional entropy random 
variable h(X\Y) cannot be expressed as a function of (A, B). 
However, if the CDF Fb of B is known, we can compute 
H(X\Y) and V(A|F) by the following formulas that are 
analogous to (0 and ®: 

H{X\Y) =EH{B), V{X\Y) = EU 2 {B) - [E H(B)] 2 . 

To see that B is less than a “sufficient statistic” for informa¬ 
tion measures, one may note that H(X) is not determined by 
knowledge of Fb alone. For example, for a BDE (A, Y) with 
Pr(A = A) = 1, we have Pr(i3 = 0) = 1, independently of 

Px(0). 

Despite its shortcomings, the /3-representation will be useful 
for our purposes due to the fact that the binary entropy function 
TL(p) is monotone over p G [0,1/2] but not over p G [0,1]. 
Thus, the random variable TL(B) is a monotone function of B 
over the range of B, but 'H(A) is not necessary so over the 
range of A. This monotonicity will be important in proving 
certain correlation inequalities later in the paper. 

C. Classification of binary data elements 

Table Q] gives a classification of a BDE (A, Y) in terms 
of the properties of B = ft(Y). The classification allows an 
erasing BDE to be extreme as a special case. 

For a pure (A, Y), we obtain from 0 and ® that 
H{X\Y) = H(b), V(X\Y) = 6(1 - b ) log 2 , 
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TABLE I 

Classification of BDEs 


Type 

Property 

pure 

P(B = b) = 1 for some b E [0,1/2] 

extreme 

P(B = 0) = 1 or P(B = 1/2) = 1 

perfect 

P(B = 0) = 1 

purely random (p.r.) 

P(B = 1/2) = 1 

erasing 

P(B = 0) + P(B = 1/2) = 1 


where b is the value that B = /3(Y) takes with probability 1. 
A simple corollary to this is the following characterization of 
an extreme BDE. 


Similarly, the event {A,,, i * A, n 2 = 0} is incompatible with 
{Ui = I}- 

Proof: For a fixed Y = (yj\ . yf), the sample values of 
A>ut,i are given by 

a 0 ut,i (yi, Vi) = Pu i \ Y lt Y 2 (0|yi, 2 / 2 ) 

= y^,PU 1 ,U 2 \Y 1 ,Y 2 ( 0 ,U 2 \yi,y 2 ) 
u 2 

= 'Yl l P x 1 \S\{ u 2\yi)px 2 \Y 2 {u2\y2) 

u 2 

= Om,l(t/l) * a in , 2 (l/ 2 ). 


Proposition 1. Let (X, Y) be a BDE and B = 0(Y). The fol¬ 
lowing three statements are equivalent: (i) (X,Y) is extreme, 
(ii) H(X\Y) = 0 or H(X\Y) = 1, (iii) V(X|Y) = 0. 

We omit the proof since it is immediate from the above 
formulas for H(X\Y) and V(X\Y) for a pure BDE. 

For an erasing (X, Y), it is easily seen that 

H(X\Y)=p, V{X\Y)=p(l-p) 

where p = P[/3(Y) = 1/2] is the erasure probability. 

Parenthetically, we note that while the entropy function sat¬ 
isfies H(X\Y) < II(X), there is no such general relationship 
between V(X\Y) and V{X). For an erasing (X, Y) with 
px{ 1 ) = 1 — p. y(0 ) = q and erasure probability p, we have 
V(X) = q(l — q) log 2 [< 7 /(1 - q)} while V(X\Y) =p(l-p). 
Either V{X) < V(X\Y) or V(X) > V(X\Y) is possible 
depending on q and p. 

D. Canonical representations under polar transform 

In this part, we explore how the a- and /3-representations 
evolve as they undergo a polar transform. Fet us return to the 
setting of Sect. II-Bl Fet (Ui,Y) and (t/ 2 ;t/i,Y) denote the 
two BDEs obtained from a pah' of independent BDEs (X\. Y\) 
and (X 2 ,Y 2 ) by the polar transform. Fet h j nl , h m 2 , h outj i, 
and h out 2 denote the entropy random variables at the input 
and output of the polar transform. For * = 1,2, let /1 ln ,; and 
Ij m j be the a- and /3-representations for the ith BDE at the 
input side; and let A outil and B out i be those for the ith BDE 
at the output side. Fet the sample values of these variables be 
denoted by small-case letters, such as a m .i for A,„ t i, b\ n ,i for 

CtC. 

Proposition 2. The a-parameters at the input and output of 
a polar transform are related by 

A 0 ut, 1 = Aj n i * A ln <2 . (9) 

, _ f ^inpAin^/(Ainp *^4/«, 2 )> U\ 0, 

A OUt,2 — \ —r . ,,-T . , rr 

| AinpAin^/i,Ai n p 4 ,4„,. 2 ). U\ — 1. 

Remark 1. In ( I I Of the event {A m \ * ,4„, 2 = 0} leads 
to an indeterminate form A OI „ 2 = 0 / 0 , but the conditional 
probability of {A, n ^i * A ,„ j2 = 0} given {U\ = 0} is zero: 
Ain,\*A ina = 0 implies (A in p,A in:2 ) G {(0,1), (1,0)}, which 
in turn implies (Xi,X 2 ) G {(1, 0), (0,1)}, giving U\ = 1. 


From this, the first statement ([9} follows. The second statement 
([Tol l can be obtained by similar reasoning. ■ 

The above result leads to the following “density evolution” 
formula. Fet F m -i, Fj n 2 , F out i, and F auti2 be the CDFs of 
^4in,i, A[ n , 2 , A out: i, and A outj2 , respectively. 

Proposition 3. The CDFs of the a-parameters at the output of 
a polar transform are related to the CDFs of the a-parameters 
at the input by 

Fout,i{p) — J'j tIFi n ^\ (cti) dFi n ,2 (tt 2 ) 

ai*a,2<CL 


Foul,2 (a) = jj (at * a 2 ) di} ni i(ai) di 7 /„ j 2 (a 2 ) 

(a\a2 / ai*a,2)<a 

JJ (ai * a 2 ) dFi„ }1 (ai) dT}„, 2 (a 2 ) 


( a\d2/a\*a2)<OL 


These density evolution equations follow from Q and ( fTOt . 
In the expression for F out 2 (a), the integrands (ai = 1 = a 2 ) and 
(ai *a 2 ) correspond to the conditional probability of U\ being 
0 and 1 , respectively, given that A; n i = ai and A; n 2 = a 2 . 
We omit the proof for brevity. 

For the /3-parameters, the analogous result to Proposition [2] 
is as follows. 


33out,l = 7(-Bin,l * 33i ni2 ), 

g _ f 7 (-®in,l-®in, 2 /(Sin,l * An, 2 )), T > 0 ; 

out ’ 2 “ \7(Bin,iBi n , a /(B in , 1 * £in, 2 )), r < 0, 

where y(x) = min{x, 1 — x} for any x G [0,1] and T = 
(1/2 — C7i)(1/2 — A; ni i)(l/2 — Ai nj2 ). We omit the derivation 
of these evolution formulas for the ^-parameters since they 
will not be used in the sequel. The main point to note here 
is that the knowledge of (f3j ni i, f?i n , 2 , U\) is not sufficient to 
determine T, hence not sufficient to determine f? 0 ut, 2 - So, there 
is no counterpart of Proposition |3] for the /3-parameters. 

Although there is no general formula for tracking the 
evolution of the /3-parameters through the polar transform, 
there is an important exceptional case in which we can track 
that evolution, namely, the case where at least one of the BDEs 
at the transform input is extreme. This special case will be 
important in the sequel, hence we consider it in some detail. 
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TABLE II 

Polar transform of extreme BDEs 


An,l 

An,2 

-Bout,! 

B 0 u[,2 

perfect 

any 

An, 2 

perfect 

p.r. 

any 

p.r. 

An,2 

any 

perfect 

An,l 

perfect 

any 

p.r. 

p.r. 

An,l 


Table [TT] summarizes the evolution of the /3-parameters for 
all possible situations in which at least one of the input BDEs 
is extreme. (In the table “p.r.” stands for “purely random”.) 


The following proposition states more precisely the way the 
/3-parameters evolve when one of the input BDEs is extreme. 

Proposition 4. If /i„, i is extreme, then the /3-parameters at 
the output are given by 


Bout, 1 — 


Bin, 2 5 

1 

2 ’ 

if Bi tl , 1 is perfect 
if B in i is p.r.-, 

( 11 ) 

0 , 

if Bi„ i is perfect 

( 12 ) 

Bin, 2 5 

if B in ,1 is p.r.. 


Bn,,, o — 


If Bj„ 2 is extreme, then and O hold after interchanging 
13in i and 13jn 2 ■ 

Proof: Suppose B m \ = 0 (perfect), then An,t can only 
take the values 0 and 1 , and we obtain from (O that 


Aut,l = An,l * An.2 = | -r' 11 ’ 2 
1 An,2, 


An,l — 0; 
An,l = 1 - 


Thus, f3 out ,i — min (Amt, i> Aut,i) — 111111 (/I m 2 , An.2) — 
An, 2 , completing the proof of the first case in (fill . We skip 
the proof of the remaining three cases since they follow by 
similar reasoning. ■ 


III. Covariance review 

In this part, we collect some basic facts about the covariance 
function, which we will need in the following sections. The 
first result is the following formula for splitting a covariance 
into two parts. 

Lemma 2. Let S, T be jointly distributed random vectors over 
R m and R", respectively. Let f, g : R m+rl —> R be functions 
such that Cov[/(S, T),p(S, T)] exists, i.e., E/(S, T)p(S, T), 
E/(S. T), and Eg(S,T) all exist. Then, 

Cov[/(S, T),g(S, T)] = E x Cov S | T [/(S, T),g(S, T)] 

+ Covt[E S | T /(S, T),E S | X . g (S, T)]. (13) 

Although this is an elementary result, we give a proof here 
mainly for illustrating the notation. Our proof follows Q. 

Proof: We will omit the arguments of the functions for 
brevity. 

Cov(/, g) = E s ,t /3 — Es,t/ • Es.t g 

= E x E S | X /g — E x [E S |t/ ' Es|t 5 ] 

+ E x [Es|t/ ■ Es|t 5 ] — EtE S | T / • E x E S | X p 
= E x Cov S | X (/, g) + Cov x (E S | X /, E S | X p). 


The second result we recall is the following inequality. 

Lemma 3 (Chebyshev’s covariance inequality). Let X be 
a random variable taking values over R and let f, g : 
R —> R be any two nondecreasing functions. Suppose 
that Cov(/(X), g(X)) exists, i.e., Ef(X)g(X), E f(X), and 
Ep(X) all exist. Then, 

Cov(/(X), 5 (X))>0. (14) 

Proof: Let X' be an independent copy of X. Let E and 
E' denote expectation with respect to X and X ', respectively. 
The proof follows readily from the following identity whose 
proof can be found in f 8 ] p. 43], 

Cov(/(X), g(X)) = Ef(X)g(X) - Ef(X)Eg(X) 

= ^E'E[(/(X) - f{X')){g{X) - g(X'))\. 

Now note that for any x, x' £ R, f(x)—f(x') and g(x)— g(x r ) 
have the same sign since both / and g are nondecreasing. 
Thus, ( f(x ) — f{x')){g{x) — g{x')) > 0, and non-negativity 
of the covariance follows. ■ 

IV. Proof of Theorem^] 

Let us recall the setting of Theorem ED We have two 
independent BDEs (X[, Y\) and (X-j, if) as inputs of a polar 
transform, and two BDEs (Ui,Y) and ((A; U\ ■ Y) at the 
output, with U\ = Xi © X' 2 , U 2 = X 2 , and Y = (Yi, Y 2 ). 
Associated with these BDEs are the conditional entropy ran¬ 
dom variables h m ^i, h m ^, Iiout.i. and h oai , 2 , as defined by (0 
and 0. We will carry out the proof mostly in terms of the 
canonical parameters A = ax,iy; : (Y;) and Bi = j3xi\ v, AA 
i = 1,2. For shorthand, we will often write X = (Xi,X 2 ), 
U = (J7i, U 2 ), A = (A u A 2 ), and B = (B U B 2 ). 

We will carry out our calculations in the probability space 
defined by the joint ensemble (X, Y). Probabilities over this 
ensemble will be denoted by P(-) and expectations by E[-]. 
Partial and conditional expectations and covariances will be 
denoted by Ey, E X |Yj Covy, Covx|y> etc. Due to the 1-1 
nature of the correspondence between U and X, expectation 
and covariance operators such as Eu| Y and Covu| Y will be 
equivalent to E X |y and Cov X | Y , respectively. We will prefer 
to use expectation operators in terms of the primary variables 
X and Y rather than the secondary (derived) variables such as 
U, A, B, to emphasize that the underlying space is (X,Y). 
We note that, due to the independence of Vj and Y 2 , A\ and 
A 2 are independent; likewise, B\ and B 2 are independent. 

A. Covariance decomposition step 

As the first step of the proof of Theorem ED we use the 
covariance decomposition formula (IT3l > to write 

Gov(/l ou t ; i, h out, 2 ) =Ey Gov X | y(^ lout, 1 ; Aut, 2 ) 

+ Covy(E X | Y A ut,i,E X |y h outi2 )- (15) 

For brevity, we will use the notation 

Covi = Ey Cov X | Y (Amt,l, Amt, 2 ) 
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C0V2 — CoVY(Ex|Y^out,l,lEx|Y^out,2) 

to denote the two terms on the right hand side of ( 031 ). Our 
proof of Theorem IH1 will consist in proving the following two 
statements. 

Proposition 5. We have Covi > 0, with equality iff either 
(Xi, Y \) or (X 2 ,Y 2 ) is an erasing BDE. 

Proposition 6 . We have Cov 2 > 0. 

Remark 2. We note that C 0 V 2 = 0 iff, of the two BDEs 
(X/, Y \) and (X 2 , Y 2 ), either one is extreme or both are pure. 
We note this only for completeness but do not use it in the 
paper. 

The rest of the section is devoted to the proof of the above 
propositions. 


B. Proof of Proposition [5] 

For p, q £ [0,1], define 

f(p, q) = ip* q) ip * q) log ( 

\p * q 


n 


pq 

p *q 


-u 


pq 

p * q 


(16) 


We will soon give a formula for Covi in terms of this function. 
First, a number of properties of f(p,q) will be listed. The 
following symmetry properties are immediate: 


fip, q) = f(p, q) = fip, q) = fip , q), (17) 

f(p,q) = f(q,p)- (18) 

Lemma 4. We have f(p,q) > 0 for all p,q £ [0,1] with 
equality iff p £ { 0 , 1 / 2 , 1 } or q £ { 0 , 1 / 2 , 1 }. 

Proof: We use © to write 

f(p,q) = f(r,s) (19) 

where r = min {p,p} and s = min{g,q}. Thus, instead of 
proving f(p,q) > 0 , it suffices to prove f(r,s) > 0 for 0 < 
r,s< 1/2. In fact, using (fl 8 l >. it suffices to prove f(r ,s)> 0 
for 0<r<s<l/2. Assuming 0 < r < s < 1/2, it is 
straightforward to show that 

_ rs r s 1 

r * s > r * s and - < -— < —. ( 20 ) 

r * s r * s 2 

Thus, if we write out the expression for /(r, s), as in (IT 6 t 
with (r, s) in place of (p, q), we can see easily that each of 
the four factors on the right hand side of that expression are 
non-negative. More specifically, the logarithmic term is non¬ 
negative due to the first inequality in (l 20 b and the bracketed 
term is non-negative due to the second inequality in (120b . This 
completes the proof that f(p , q) > 0 for all p, q £ [ 0 , 1 ]. 

Next, we identify the necessary and sufficient conditions for 
f{p,q) to be zero over 0 < p, q < 1. Clearly, f(p,q) = 0 
iff one of the four factors on the right hand side of ( 1 1 6b 
equals zero. By straightforward algebra, one can verify the 
following statements. The first factor p * q equals zero iff 
ip, q) £ {(0,1), (1, 0)}. The second factor p*q equals zero iff 
(p,q) £ {(0,0), (1,1)}. The log term equals zero iff p = 1/2 


or q = 1/2. Finally the difference of the entropy terms equals 
zero iff pq/p* q = pq/p*q or pq/p* q = 1 — pq/p * q which 
in turn is true iff p £ {0,1/2,1} or q £ {0,1/2,1}. Taking 
the logical combination of these conditions we conclude that 
fip, q) = 0 iff P £ {0,1/2,1} or q £ {0,1/2,1}. ■ 

Lemma 5. We have 


Co Vl = E/(A) = E/(B). (21) 


Proof: Fix a sample y = (y 1 , 2 / 2 )- Note that 


Cov X jy(/iout,i,frout, 2 ) = Cov X | y (/i((7i|y),/i((72|/7i,y)) 
= E X |y { [hilh |y) - Hilh |y)] h(U 2 \ tft, y )} 

= X^lvMyHMuily) - H{U l \y)\H{U2\u 1 ,y). 

U \ 


After some algebra, the term [h{u\ |y) — i?(E/i|y)] simplifies 
to 


(1 -Pt/i|Y(ui|y))log 


1 ~Pih\ y('»i|y) 

Pt/iiY(ui|y) 


Substituting this in the preceding equation and writing out the 
sum over U\ explicitly, we obtain 


C 0 V X |y(^out,l, ^out, 2 ) — P£/i|Y(0|y)P[/i| Y (l|y) 

■ log [H(U 2 \lh = l,y)-H(U 2 \U 1 =0, y)]. 

PcMY(i|y) 

Expressing each factor on the right side of the above equation 
in terms of cti = aiyf), i = 1 , 2 , we see that it equals 
/(ai,a 2 ). Taking expectations, we obtain Covi = E/(A). 
The alternative formula Covi = E/(B) follows from the fact 
that /(B) = /(A) due to the symmetries ■ 

Proposition 0 now follows readily. We have Covi > 0 
since /(ai,a 2 ) > 0 for all ai,a 2 £ [0,1] by Lemma Q] By 
the same lemma, strict positivity, E/(A) > 0, is possible iff 
the events A\ f. { 0 , 1 / 2 , 1 } and A 2 f. { 0 , 1 / 2 , 1 } can occur 
simultaneously with non-zero probability, i.e., iff 

$ { 0 , i 1 }) p(a 2 i { 0 , i, 1 }j > 0 , ( 22 ) 

since A\ and ,4 2 are independent. Condition ( 122b is true iff 

^(^i^{0,i}) P^2^{0,i}) >0, (23) 

which in turn is true iff neither // nor II 2 is erasing. This 
completes the proof of Proposition [5] 


C. Proof of Proposition [ 6} 

Let sn(p,g) = Hip*q) and g 2 {p,q) = P(p) + P(g)-P(p* 
q) for p,q £ [0,1]. These functions will be used to give an 
explicit expression for C 0 V 2 . Lirst, we note some symmetry 
properties of the two functions. Lor i = 1,2, we have 

9i(p,q) = giip,q) = 9i ip, q) = g, (p, q) , (24) 

9iip,q)= 9iiq,p)- (25) 

We omit the proofs since they are immediate. 
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Lemma 6. We have , for i = 1,2, 

Exiy^W,; = 5i( A ) = (26) 

Proof: These results follow from (J 6 ]», 01 , and ( I I Oi l. We 
compute ExiY^out.i as follows. 

Ex|Y^out,i = Eu|A^out,i = 7 ~L(Ai * A 2 ) = 51 (A). 

For the second term, we use the entropy conservation (0. 

Ex| Y6out,2 = E X |Y^-in,l + E X |Y6in,2 ~ E X |Y^out,l 

= H(Ai) + H(A 2 ) — 'H(Ai *A 2 ) = 52(A). 

The second form of the formulas in terms of B follow from 
the symmetry properties (l24l >. ■ 

As a corollary to Lemma [ 6 ] we now have 

Cov 2 = Cov[ Sl (B), 52 (B)]. (27) 

In order to prove that Cov 2 > 0, we will apply Lemma 0 to 
d27]>. First, we need to establish some monotonicity properties 
of the functions 51 and g 2 . We insert here a general definition. 

Definition 1. A function g : R n —>• M is called nondecreasing 
if for all x, y £ R™, p(x) < g{y) whenever Xi < yi for all 
i = 1 ,..., n. 

Lemma 7. 51 : [0,1/2] 2 —» R + is nondecreasing. 

Proof: Since 51(61,62) = 51(62,61), it suffices to show 
that 51(61,62) is nondecreasing in h\ £ [ 0 , 1 / 2 ] for fixed 6 2 £ 
[0,1/2]. So, fix 6 2 £ [0,1/2] and consider 51(61, 6 2 ) as a 
function of 61 £ [0,1/2]. Recall the well-known facts that the 
function 'H(p) over p £ [ 0 , 1 ] is a strictly concave non-negative 
function, symmetric around p = 1/2, attaining its minimum 
value of 0 at p £ { 0 , 1 }, and its maximum value of 1 at p = 
1/2. It is readily verified that, for any fixed 6 2 £ [0,1/2], as h\ 
ranges from 0 to 1 / 2 , 61 * 6 2 decreases from 6 2 to 1/2, hence 
51(61,62) = H( 61 *6 2 ) increases from / H(b 2 ) to H( 1/2) = 1, 
with strict monotonicity if 6 2 / 1/2. This completes the proof. 

■ 

Lemma 8. g 2 : [0,1/2] 2 —>• R + is nondecreasing. 

Proof: Again, since 52 ( 61 , 62 ) = 52 ( 62 , 61 ), it suffices 
to show that 52 ( 61 , 62 ) is nondecreasing in 61 £ [ 0 , 1 / 2 ] for 
fixed 6 2 £ [0,1/2]. Recall that 52 ( 61 , 6 2 ) = H( 61 ) + H{b 2 ) - 
77 (61 * 6 2 ). Exclude the constant term 77( b 2 ) and focus on the 
behavior of 7(6/ = 77(6i > 1 = b 2 ) — 77(6/ over 61 £ [0,1/2]. 
Observe that I ( 61 ) is the mutual information between the input 
and output terminals of a BSC with crossover probability 61 
and a Bernoulli- 6 2 input. The mutual information between 
the input and output of a discrete memoryless channel is a 
convex function of the set of channel transition probabilities 
for any fixed input probability assignment 0 p. 90], So, 
7(6/ is convex in b\ £ [0,1/2]. Since 7(0) = 77 ( 62 ) and 
7(1/2) = 0, it follows from the convexity property that 7(6i) 
is decreasing in 61 £ [ 0 , 1 / 2 ], and strictly decreasing if b 2 / 0 . 
This completes the proof. ■ 

Proposition 0can now be proved as follows. First, we apply 
Lemma [2] to (l27l > to decompose Cov 2 as 

Cov( 5 i(B), 52 (B)) =E Bl Cov B 2 ( 5 i(B), 52 (B)) 

+ Cov Bi (E B 2 5 i(B),E B 2 5 2 (B)). 


Each covariance term on the right side is positive by Cheby- 
shev’s correlation inequality (Lemma 0 and the fact that 
51 and g 2 are nondecreasing in the sense of Def. Q] More 
specifically, Chebyshev’s inequality implies that 

Cov S2 (51(61, B 2 ), 52(61,£ 2 )) > 0 

for any fixed 61 £ [0,1/2] since 51(61,62) and 52(61,62) 
are nondecreasing functions of 62 when 61 is fixed. Likewise, 
Chebyshev’s inequality implies that 

Cov Bi (E B2 5 1 (B),E S2 5 2 (B)) > 0 

since E B2 5 i(6i, T? 2 ) and Eb 2 5 2 (6i, B 2 ) are, as a simple 
consequence of Lemma 0 nondecreasing functions of 61. 

D. Proof of Theorem 0] 

The covariance inequality 0 is an immediate consequence 
of © and Propositions 0 and [6] We only need to identify 
the necessary and sufficient conditions for the covariance to 
be zero. For brevity, let us define 

T = “Hi or B 2 is extreme”. 

The present goal is to prove that 

Cov(6 out ,i, /i out ,2) = 0 iff T holds. ( 28 ) 

The proof will make us of the decomposition 

Cov(ft out) i, /tom,2) = Covi + Cov 2 

= E/(B) + Cov( Sl (B), 52(B)) ( 29 ) 

that we have already established. Let us define 

R = “Bi or B 2 is erasing” 

and note that R appears in Proposition 0 as the necessary 
and sufficient conditions for Covi to be zero. Note also that 
T implies R since “extreme” is a special instance “erasing” 
according to definitions in Table Q] 

We begin the proof of < [ 28 l) with the sufficiency part, in 
other words, by assuming that T holds. Since T implies R, 
T is sufficient for Covi = 0 . To show that T is sufficient 
for Cov 2 = 0 , we recall Proposition 0 which states that, if 
T is true, then either B out i or /i out .2 is extreme. To be more 
specific, if T?i nj i or B{ n 2 is p.r., then 7 ? outi i = 1/2 and 51(B) = 
1 ; if B m \ or B m 2 is perfect, then B out 2 = 0 and 52(B) = 0 . 
(The notation “=” should be read as “equals with probability 
one”.) In either case, Cov 2 = Cov(5i(B), 52(B)) = 0 . This 
completes the proof of the sufficiency part. 

To prove necessity in ( | 28 | >. we write T as 

T = R A {R c V T) (30) 

where R c denotes the complement (negation) of R. The 
validity of (l 30 t follows from RAT = T. To prove neces¬ 
sity, we will use contraposition and show that T c implies 
Cov(/i outi i, /i out ,2) > 0 . Note that T c = R c V ( 7 ? A T c ). If 
T c is true, then either R c or ( 7 ? A T c ) is true. If R c is true, 
then Covi > 0 by Proposition 0 We will complete the proof 
by showing that RAT C implies Cov(/i 0 ut,i> ^out,2) > 0. For 


this, we note that when one of the BDEs is erasing, there is 
an explicit formula for C 0 V 2 . We state this result as follows. 

Lemma 9. Let Ij\ be erasing with erasure probability e = 
P(Bi = 1/2) and let B 2 be arbitrary with 5 = H(X 2 \Y 2 ). 
Then, 

Cov 2 = e(l — e)£(l — S) (31) 

This formula remains valid if B 2 is erasing with erasure 
probability e = P(f ?2 = 1/2) and B± is arbitrary with 
S^H(X 1 \Y 1 ). 


Proof: We first observe that 


9i{B\,B 2 ) 


g2(Bi,B 2 ) 


n(B 2 ), B i=0; 
1) B\ = i; 

0, B\ = 0; 

H(B 2 ), = 


Now, the claim (EB is obtained by simply computing the 
covariance of these two random variables. The second claim 
follows by the symmetry property (|25] >. ■ 

Returning to the proof of Theorem CD the proof of the 
necessity part is now completed as follows. If R A T c holds, 
then at least one of the BDEs is strictly erasing (has erasure 
probability 0 < e < 1) and the other is non-extreme. By 
Proposition Q] the conditional entropy H(X\Y) of a non¬ 
extreme BDE (X,Y) is strictly between 0 and 1. So, by 
Lemma [9] we have C 0 V 2 > 0. This completes the proof. 


V. Varentropy under higher-order transforms 
In this part, we consider the behavior of varentropy under 
higher-order polar transforms. The section concludes with a 
proof of the polarization theorem using properties of varen¬ 
tropy. 


A. Polar transform of higher orders 

For any n > 1, there is a polar transform of order N = 2". 
A polar transform of order N = 2™ is a mapping tpN that takes 
N BDEs { (X t , Y, )}//-,, as input, and produces a new set of 
N BDEs {(Up, IP" 1 , Y)}W where Y = (Y x ,..., Y N ) and 
U I_1 = (U \,..., Ui-\) is a subvector of U = (U \,..., Un), 
which in turn is obtained from X = (X \...., X^) by the 
transform 


U = XGjy, Giv = F® n , 



0 

1 


(32) 


The sign “®n” in the exponent denotes the nth Kronecker 
power. We allow Y to take values in some arbitrary set f ), 
1 < i < N, which is not necessarily discrete. We assume 
that ( Xi, Yf), 1 < * < N, are independent but not necessarily 
identically-distributed. 

(An alternate form of the polar transform matrix, as used 
in 0, is Gn — BjvF®”, in which Bjv is a permutation 
matrix known as bit-reversal. The form of G y that we are 
using here is less complex and adequate for the purposes of 
this paper. However, if desired, the results given below can 
be proved under bit-reversal (or, any other permutation) after 
suitable re-indexing of variables.) 


B. Polarization results 

The first result in this section is a generalization of Theo¬ 
rem [T] to higher order polar transforms. 

Theorem 2. Let N = 2 n for some n > 1. Let (Xj, Yf, 
1 < i < N, be independent but not necessarily identically 
distributed BDEs. Consider the polar transform U = XGjy 
and let (Up U I_1 , Y), 1 < i < N, be the BDEs at the output 
of the polar transform. The varentropy is nonincreasing under 
any such polar transform in the sense that 
N N 

J2vmu i -\Y) <y j v(x,\y 1 ). ( 33 ) 

i-1 i= 1 

The next result considers the special case in which the BDEs 
at the input of the polar transform are i.i.d. and the transform 
size goes to infinity. 

Theorem 3. Let (X,, V, ), 1 < i < N, be i.i.d. copies of a 
given BDE (X, Y). Consider the polar transform U = XGjy 
and let (Up U* _1 , Y), 1 < i < N, be the BDEs at the output 
of the polar transform. Then, the average varentropy at the 
output goes to zero asymptotically: 

1 N 

— ^2 v (Ui\U l -\ Y )^0, asN^oo. (34) 

V 2 = 1 

C. Proof of Theorem [2] 

We will first bring out the recursive nature of the polar 
transform by giving a more abstract formulation in terms of 
the a-parameters of the variables involved. Let us recall that 
a polar transform of order two is essentially a mapping of the 
form 

(^-in,l 5 ^-in,2) ^ (^out,l? ^lout^), (35) 

where ,4 ln i and /l ln 2 are the a-parameters of the input 
BDEs (Xi,Yi) and (X 2 ,Y 2 ), and A out i and A out! 2 are the 
a-parameters of the output BDEs (U±,Y) and ( U 2 : U±,Y). 

Alternatively, the polar transform may be viewed as an op¬ 
eration in the space of CDFs of a-parameters and represented 
in the form 


(F oa ty, F ou ^ 2 ) — f)2(Fmp, F m ^) (36) 

where F m l and F ouU , are the CDFs of A mj and ,4 out ; , 
respectively. 

Let M be the space of all CDFs belonging to random 
variables defined on the interval [0,1]. The CDF of any a- 
parameter A belongs to A4, and conversely, each CDF F € Ai 
defines a valid a-parameter A. Thus, we may regard the polar 
transform of order two (f36l) as an operator of the form 

: M 2 -»■ M 2 . (37) 

We will define higher order polar transforms following this 
viewpoint. 

For each i = 1 let ,4 m; denote the a-parameter 

of the zth BDE (X,. Y) at the input, and let F m , denote the 
CDF of ,4j Ivl . Likewise, let ,4 ouLj denote the a-parameter of 
the ith BDE ((7i;U i_1 ,Y) at the output, and let F out i be 
the CDF of Aqu^. Let Fi n — (F| n} i, • ■ •, F[ n j\i^ and F ou t — 
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(F ou t,i, ■ ■ •, F out ,jv)- We will represent a polar transform of 
order N abstractly as F out = iI’n{ Fi„). 

There is a recursive formula that defines the polar transform 
of order N in terms of the polar transform of order N/2. Let 
us split the output F out into two halves as F out = (F^ ut , F" ut ). 
Each half is obtained by a size-X/2 transform of the form 


Fo Ut = ^tv/ 2 (F' n ), F" ut = Vw /2 (F"), 


,F'J 


in which F' n = (F/ nil ,..., F' n N/2 ), F" = (F" 1; 
are obtained from F; n through a series of size-2 transforms 


,N/2 


K :,<) = MFi „,i, F in i+ jv /2 ); 1 < * < n/2. (38) 


The derivation of the above recursion from the algebraic 
definition (l32l > is standard knowledge in polar coding, and will 
be omitted. 

Let us write V(F) to denote the varentropy V(X\Y) of a 
BDE (X, Y) whose a-parameter has CDF F. Using ([ 8 }, we 
can write V(F) as 

^ 2 ( 0 ) dF(a) — H(a)dF(a) 

We are now ready to prove Theorem[2] The proof will be by 
induction. First note that the claim (l33l) is true for N = 2 by 
Theorem Q] Let N > 4 and suppose, as induction hypothesis, 
that the claim is true for transforms of orders N/2 and smaller. 
We will show that the claim is true for order N. By the 
induction hypothesis, we have 




N/2 N/2 

E y ( F ouM)<E y ^M) 


i—1 


2=1 


and 


N/2 N/2 

E y ( F ouM)<E y ( F i",i)- 


2=1 2=1 

Summing (|40] > and (1411 side by side, 


N 


N/2 


Em,u M )<E 


V{FU + U(F" •) 


2=1 2=1 L 

Using the induction hypothesis again, we obtain 

V{F-N) + U(F'L) < U(F inii ) + U(F irM+JV/2 ) 


(40) 


(41) 


(42) 


(43) 


set F 0 j = F 0 . In this notation, we can express the normalized 
varentropy compactly as 

2 n 2 n 

Vn = ^ E miU-'.Y) = ^ E W.O. n > 1 ’ 

2=1 2=1 

and Vq = V(F 0 ). The sequence { V n j is non-negative (since 
each V n is a sum of varentropies), and nonincreasing by 
Theorem [2] Thus { V,, } converges to a limit c > 0. Our goal 
is to prove that c = 0 . 

The analysis in the proof of Theorem [2] covers the present 
case as a special instance. In the present notation, the recursive 
relation (l38l > takes the form 

(Fn,i,F n i+2 n-i) = F n -l,i), 1 <l< 2 n ~ 1 , 

since here we have F„_i j = F n _ 1 i+ 2 n-i due to i.i.d. BDEs 
at the transform input. Using this relation, we obtain readily 
an explicit formula for the incremental change in normalized 
varentropy from generation n to (n + 1 ), namely, 

2 n 

D n+1 =U„ + 1 -U n = -E C ( F -4)» (44) 

2=1 

where 

C(F n ,i) = U(F„,i) - [V (F„ +li j) + V(F n + l,2+2 n )]/2. (45) 

If we denote the conditional entropy random variables in the 
polar transform as {h n ^}, it can be seen that 

C(F nj j) = Cov(/i n _|_i j, /i n _|_i j_|_2 n )■ 

Thus, we have C(F n ,i) > 0 by Theorem ED implying that 
D n < 0 for all n > 1. It is useful to note here that 

CXD 

c = lim V n = V(Fq) - E D n , (46) 

n —»oo ^ J 

2 = 1 

showing explicitly that c is the limit of a monotone nonin¬ 
creasing sequence of sums. 

For 5 > 0, let 


Ms = {FeM: 

V(F) > <S}. 

(47) 

A (6) = inf (C(F) 

:F£Ms}. 

(48) 


As we will see in a moment, the main technical problem that 
remains is to show that 


for all i = 1,..., N/2. The proof if completed by using (l43l > 
to upper-bound the right side of ll42l > further. 

D. Proof of Theorem [T] 

In this proof we will consider a sequence of polar transforms 
indexed by n > 1. For a given n, the size of the transform is 
N = 2 n \ the inputs of the transform are (X;, Yf), 1 < i < N, 
which are i.i.d. copies of a given BDE (X, F); the outputs of 
the transform, which we will refer to as “the nth generation 
BDEs”, are (Ilf, UP" 1 , Y), 1 < i < N. Let F 0 denote the 
CDF of (X, Y). Let F rul denote the CDF of (£/*; U* -1 , Y), 
the ith BDE in the nth generation, n > 1, 1 < i < 2 n , and 


S > 0 => A (6) > 0. (49) 

While this proposition seems plausible in view of the fact 
that C(F) = 0 iff V (F) = 0 (by Theorem fiTI) . there is the 
technical question of whether the “inf” in (l48l > is achieved as 
a “min” by some F £ A4$. We will first complete the proof 
of Theorem [2 by assuming that ( 149b holds. Then, we will give 
a proof of (|49]> in the Appendix. 

Let J n (6) = {1 < i < 2 n : F n ,i G Ms}, and P n {6) = 

| J n (S)\/2 n . For 5 > 0, we may think of J n (S) as the set of 
“bad” BDEs in the nth generation and Jf/S) as their fraction 
in the same population. From (144b . we obtain the bound 

Dn < -P n (5)A(S), S> 0. 


(50) 
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To apply this bound effectively, we need a lower bound on 
P n {5). To derive such a lower bound, we observe that, for 
any <5 > 0, 


Define C : Ai —> K as the mapping 

C(F) = V(F)~ [V(F~) + V{F+)\/2. (54) 


V n <[l- P n (5)}6 + P n (S)M <6 + P n (S)M (51) 

where M = 2.3434 is the bound on varentropy provided by 
Lemma Q] Let no be such that for all n > no, V n > c/2. 
Since {V n } converges to c > 0, no exists and is finite. 
This, combined with (15Tb . implies the following bound on the 
fraction of bad indices. 


Pn(S) > 


Vn-5 

M 


> 


c/2-5 

M 


Using ( l52l > in (l50l > with 5 = c /4 gives 


n > n q. (52) 


D n < -(c/4M) • A(c/4), n > no- (53) 


From (l46l) . we see that having c > 0 is incompatible with (l53l) . 
This completes the proof that c = 0 (subject to the assumption 
that ( l49l ) holds, which is proved in the Appendix). 


VI. Concluding remarks 

One of the implications of the convergence of average 
varentropy to zero is that the entropy random variables “con¬ 
centrate” around their means along almost all trajectories of 
the polar transform. This concentration phenomenon provides 
a theoretical basis for understanding why polar decoders are 
robust against quantization of likelihoood ratios on. 

Theorem [3 may be seen as an alternative version of the 
“polarization” results of ffl. In El, the analysis was centered 
around the mutual information function and martingale meth¬ 
ods were used to establish asymptotic results. The present 
study is centered around the varentropy and uses weak conveg- 
ence of probability distributions. The use of weak convergence 
in such problems is not new; Richardson and Urbanke 
pp. 187-188] used similar methods to deal with problems of 
convergence of functionals defined on the space of binary 
memoryless channels. 

We should mention that Alsan and Telatar HU have given 
an elementary proof of polarization that avoids martingale 
theory, and instead, uses Mrs. Gerber’s lemma m. It appears 
possible to adopt the method of im to establish Theorem [3] 
without using weak convergence. 

Appendix 
Proof of d49l > 

Lemma 10. The space M of CDFs on [0,1] is a compact 
metric space. 

Proof: This follows from a general result about probabil¬ 
ity measures on compact metric spaces. Theorem 6.4 in fl4l 
p. 45] states that, for any compact metric space X, the space 
A4(X) of all probability measures defined on the a -algebra 
of Borel sets in X is compact. Our definition of A4 above 
coincides with the J\A(X ) with X = [0,1]. ■ 

For F G M , let F~ and F + be defined by (see l l37l >) 

(■= ). 


This definition is a repetition of (l45l > in a more convenient 
notation. We have already seen the interpretation of C(F) as 
a covariance and mentioned that C(F) > 0. It is also clear that 
C(F) is bounded: C(F) < V{F) < M, where M = 2.3434. 
Thus, we may restrict the range of C and write it as a mapping 
C:M->[0,M\. 

Lemma 11. The mapping C : .54 -a [(), M] is continuous 
(w.r.t. the weak topology on A4 and the usual topology of 
Borel sets in R). 

Proof: We wish to show that if F n => f q (in the sense of 
weak-convergence), then \C(F n ) — C(Fo)| —>■ 0. We observe 
from (f39l i that V(F) is given in terms of expectations of two 
bounded uniformly continuous functions, TL : [0,1] —>■ [0,1] 
and H 2 ■ [0,1] —> [0, M\. Thus, by definition of weak 
convergence (JH p. 40]), we have \V(F n ) - V(F 0 )| -A 0. 
In view of (l54l >. the proof will be complete if we can show 
that (F n =t- F 0 ) implies (F~ => F 0 “) and (Fj+ =>■ Fq), where 
F~ = ( F n )~ , etc. By the “portmanteau” theorem (see, e.g.. 
Theorem 6.1 in El p. 40]), it is sufficient to show that for 
every open set G C [0,1], 


lim inf 

f d F n (a) > 

[ dF 0 (a), 

(55) 

n 

Ig 

JG 


lim inf 

n 

f Q dF+{a)> 

[ d F+(a). 

JG 

(56) 


To prove (l55l) . let /1 : [0, l ] 2 —> [0,1] be such that 
f 1 ( 01 , 02 ) = ai * 02 . Then, we can write 

p f( G ) = J dFf(a)= JJ dF n (a 1 )dF n (a 2 ), 

/i _1 (G) 

which follows from the density evolution equation 


F n (a) = jj dF„(ai)dF„(a 2 ) 

ai*a2^a 

that was proved as part of Proposition Q] We note that (i) the 
pre-image /1 (G) C [0, l ] 2 is an open set since the function / is 
a continuous and (ii) the product measure F n x F n converges 
weakly to Fq x Fo 113] p. 21, Thm. 3.2]; so, again by the 
portmanteau theorem, 


lim inf 

n 


dF n (ai) dF n (a 2 ) > 


dF 0 (ai)dF 0 (a 2 ). 


/f 1 ^) 


fT\G) 


Since 


JJ dF 0 (ai)dF 0 (a 2 ) 

/r'lG) 


dF 0 (a), 


the proof is complete. 

The second condition (l56l > can be proved in a similar 
manner. We will sketch the steps of the proof but leave out the 
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details. The relevant form of the density evolution equation is 
now 


F+(a) = JJ (ai *a 2 )dF n (ai)dF„(a 2 ) 

{a\a2 / a\*a2)<a 

JJ (ai*a 2 )dF n (a 1 )dF n (a 2 ). 


{a\a2 / ai*a2)<a 


We define f 2 i(ai,a 2 ) = aia 2 /ai * a 2 and f 22 (ai,a 2 ) = 
a,ia 2 /ai * a 2 , and write 


P+(G) = J d F+(a) = JJ (ai * a 2 ) dF n (ai) dF n (a 2 ) 

f21(G) 

+ (ai* a 2 ) dP n (ai) dP n (a 2 ). 


/22(G) 


Next, we note that, by a general result on the preservation of 
weak convergence JT3] Thm. 5.1], 


(ai *a 2 )dF rl (ai)dF n (a 2 ) =>■ (ai * a 2 ) dF 0 (ai) dF 0 (a2), 


(ai * a 2 ) dF n (ai) dP n (a 2 ) =>■ (ai * a 2 ) dP 0 (ai) dF 0 (a 2 )- 

(The important point here is that the functions (ai * a 2 ) and 
(01 * a 2 ) are uniformly continuous and bounded over the 
domain (ai,a 2 ) £ [0, l] 2 . The claimed convergences follow 
readily from the definition of weak convergence.) The proof 
is completed by writing 
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liminfP+(G)> JJ (ai * a 2 ) dF 0 (ai) dP 0 (a 2 ) 

f2l(G) 


+ JJ (ai *a 2 )dPo(ai)dF 0 (a 2 ) = J dF 0 + (a). 

f22(G) 


Lemma 12. For 5 > 0, A(<5) > 0. 

Proof: Fix 5 > 0 . The set Ms can be written as the 
pre-image of a closed set under a continuous function: Ms = 
G -1 ([5, M]). Hence, by a general result about continuity ( II 1 61 
4.8]), Ms is closed; and, being a subset of the compact set 
[ 0 , 1 ], it is compact ( lfl6l 2 . 35 ]). Since G is continuous and Ms 
is compact, the “inf” in (l48l > is achieved by some F f} £ Ms 
(Q6l 4.16]): A (S) = C(F 0 ). Since V{F 0 ) > 6 > 0 , F 0 is not 
extreme, so by Theorem fTTl C(Fq) > 0 . ■ 
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